Skip to content

Conversation

@gangj
Copy link
Contributor

@gangj gangj commented Jan 6, 2026

Problem: CA-384228 - Xapi fails to start on slave during pool join

When a slave joins a pool with jumbo frames (MTU 9000) configured, xapi can hang
during "Synchronising bonds on slave with master" if the network path doesn't
support the configured MTU.

Root Cause:

  • Management interface configured with MTU=9000 (jumbo frames)
  • Network path actual MTU ~1500 bytes (switches/routers don't support jumbo frames)
  • During pool join, xapi makes RPC calls to master with large requests (~1613 bytes)
  • Packets exceeding path MTU are silently dropped by network infrastructure
  • Without Path MTU Discovery enabled, TCP cannot detect the MTU mismatch
  • TCP retransmissions fail repeatedly as packets continue to be dropped
  • The RPC call hangs, eventually timing out
  • Application-level retry logic (in master_connection.ml) attempts reconnection
  • Each retry encounters the same MTU issue
  • Meanwhile, the retrying connection holds a database lock
  • Any other DB request from slave to master is blocked waiting for this lock
  • This causes the entire slave xapi to hang, not just the pool join operation
  • Pool join hangs for an extended period before eventually failing

Why it happens:
By default, TCP relies on ICMP "Fragmentation Needed" messages to discover path MTU.
When the interface is configured with MTU=9000 but the network path only supports
1500 bytes:

  • If ICMP is working: Router sends "Fragmentation Needed" ICMP message back to TCP,
    TCP reduces packet size to 1500, connection works fine.

  • If ICMP is blocked (CA-384228 scenario): Router drops large packets but the ICMP
    message is blocked by firewalls. TCP has no way to discover the mismatch. It keeps
    retrying with 9000-byte packets that get silently dropped, leading to connection hangs and
    database lock contention..

Before this fix: TCP depends entirely on ICMP for MTU discovery. When ICMP is
blocked, TCP cannot adapt, causing extended hangs and database deadlocks.

After this fix: TCP can detect packet loss patterns and proactively reduce packet
size even without ICMP feedback, preventing hangs and allowing pool operations to
complete successfully.

Solution Overview:
This fix has two parts:

  1. Enable TCP Path MTU Discovery (PMTUD) - Allows TCP to automatically detect
    and adapt to path MTU, preventing hangs
  2. Add diagnostics during pool join - Detect and warn about MTU mismatches
    for visibility, creates alert for customer awareness

Commit 1: CA-384228: Enable TCP Path MTU Discovery by default

Add sysctl configuration to enable TCP PMTUD on all XenServer hosts.
This prevents TCP connection hangs when path MTU is smaller than configured
interface MTU (e.g., jumbo frames configured but network infrastructure
doesn't support them).

How it fixes the hang:
With PMTUD enabled, TCP can now automatically:

  1. Detect packet loss patterns indicating MTU issues
  2. Reduce packet size (MSS) to find working MTU
  3. Continue communication with adjusted packet size
  4. Work even when ICMP is blocked by firewalls

This prevents the database lock contention that causes slave xapi to hang completely.

Configuration:

  • net.ipv4.tcp_mtu_probing=1: Enable automatic MTU detection when ICMP
    blackholed (recommended setting)
  • net.ipv4.tcp_base_mss=1024: Base MSS for MTU probing

Files:

  • scripts/92-xapi-tcp-mtu.conf: New sysctl configuration file
  • scripts/Makefile: Install sysctl config to /etc/sysctl.d/

The "92" prefix ensures this loads after basic network configuration
(91-net-ipv6.conf) but before local administrator overrides (99-*).

Reference: https://blog.cloudflare.com/path-mtu-discovery-in-practice/


Commit 2: CA-384228: Add MTU diagnostics during pool join

Add diagnostic tests during pool join to detect and warn about MTU
mismatches, particularly when higher MTU values are configured but
the network path doesn't support them.

Why diagnostics are needed:
While TCP PMTUD (commit 1) fixes the hang automatically, customers need
visibility into MTU configuration issues. This creates an alert visible
in XenCenter/CLI when path MTU < configured MTU, prompting infrastructure
fixes to prevent performance degradation.

The diagnostics:

  1. Query master's management network MTU via RPC
  2. Detect VLAN configuration and account for 4-byte overhead
  3. Calculate ICMP payload dynamically: MTU - IP header (20) - ICMP header (8) - VLAN (4 if present)
  4. Test standard MTU (1500) with ICMP ping
  5. Test configured MTU if > 1500
  6. Create pool-level alert when CA-384228 scenario detected:
    • Standard MTU (1500) works
    • Configured higher MTU fails
    • This indicates path MTU < configured MTU

Key design decisions:

  • Does NOT block pool join (ICMP may be blocked by firewalls)
  • Queries master's DB via verified RPC (slave's DB not yet synced)
  • Called after certificate exchange with verified connection
  • Creates pool-level alert for customer visibility in XenCenter/CLI
  • Relies on TCP PMTUD (enabled by commit 1) to handle issues automatically
  • Diagnostics are informational only, providing visibility

The implementation dynamically calculates test packet sizes based on
actual configured MTU rather than assuming fixed values, making it
work correctly with any MTU configuration (not just jumbo frames).

Warning format highlights the issue clearly and references the
TCP PMTUD fix that handles it automatically, with guidance for
persistent problems.

(* Test MTU connectivity using ping - ICMP-based, informational only *)
let test_ping size desc =
try
let timeout = 3.0 *. 1e9 |> Int64.of_float |> Mtime.Span.of_uint64_ns in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use Mtime.Span.(3 * s) or similar to create a 3s span.

match (standard_ok, jumbo_ok) with
| true, false ->
(* CA-384228 scenario: standard works but jumbo fails *)
warn
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to create an alert such that the customer would see it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the advice, I think it is a great idea, added now, please help to review again, thank you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alert shown in XC after pool join:
image

- 1472 = 1500 (standard MTU) - 20 (IP header) - 8 (ICMP header)
- 8972 = 9000 (jumbo MTU) - 20 (IP header) - 8 (ICMP header) *)
let standard_mtu_icmp_payload = 1472 in
let jumbo_mtu_icmp_payload = 8972 in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we calculate this dynamically based on the actual MTU like mtu - 28, and then try a probe for MTU=1500, and another for the actually configured MTU.
That way even if the user configures an MTU slightly smaller than 9000 it'd work.

Also do we need to take the size of the VLAN tag into account when we're on a VLAN?

@gangj gangj force-pushed the private/gangj/CA-384228 branch 2 times, most recently from 1d0e0d4 to 0d7e423 Compare January 8, 2026 09:16
# This is the starting point for MTU probing when enabled

net.ipv4.tcp_mtu_probing = 1
net.ipv4.tcp_base_mss = 1024
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

configuration files has been a salient point of issues regarding user configuration in xcp-ng. I'm asking the platform teams whether this change follows their recommendations

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @psafont , I understand your point.
Would you please share more about the good practice or recommendations? Thank you.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm expecting somebody from xcp-ng's platform team to share them here

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not from platform team, but indeed, system configuration files should likely be part of packages like, on xcp-ng side, xcp-ng-release.

  • Do we actually need a config file to enforce that?
  • Could it be a setting that xapi applies istead? Or would that not be possible to be achieved soon enough for slaves to be able to reach master?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens on xapi updates when this file has been changed manually? Would the next xapi update overwrite any change? We might want to add a line instructing not to change this file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you manually edit a file in /etc/sysctl.d/ that is owned by a package, whether an update overwrites it is controlled by how the package marked that file:

%config(noreplace) → your edited file is kept, and the new packaged version is written as .rpmnew.
%config (without noreplace) → your edited file may be replaced, and your previous version is saved as .rpmsave.

Runtime precedence: When sysctl --system is run, settings from /etc/sysctl.d override vendor defaults from /usr/lib/sysctl.d. The last assignment wins if the same key is set multiple times.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means, we should write this to /usr/lib/sysctl.d such that a user would put an overwrite into /etc/sysctl.d/?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current implementation installs via xapi's Makefile to /etc/sysctl.d/ for these reasons:

  1. Tight coupling: This sysctl directly fixes xapi's CA-384228 hang issue. The diagnostic
    code added in commit 2 assumes TCP PMTUD is enabled.
  2. Immediate availability: Installed with xapi, no dependency on separate package coordination.
  3. Standard location: /etc/sysctl.d/ with priority 92 (after system defaults like 91-,
    before admin overrides like 99-
    ). Admins can override with higher-priority files if needed.
  4. Precedent in xapi: xapi's Makefile already installs various system configuration files.
    This follows the same pattern.

However, I understand the concerns about package ownership and platform-wide impact.

Alternative: We could move to /usr/lib/sysctl.d/92-xapi-tcp-mtu.conf instead, which is:

  • Package-owned location (not user config space)
  • Still loads at the right priority
  • Admin overrides in /etc/sysctl.d/ would automatically take precedence
  • More aligned with package management best practices

I'm happy to hear the preference from the platform team or other recommendations. Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read points 1, 2, 3 and 4, and pardon me if I'm imagining things, but it looks like they are justifications made by a LLM.

I'm not going to comment on them one by one but will instead try to think about the big picture.

So the main questions are:

  1. Whether to make the configuration file belong to the XAPI package or something else. It is required by XAPI for some functionality, and doesn't modify an existing system file, so in my opinion it can make sense to package it along with XAPI or xcp-networkd.
  2. How to manage potential user modifications to the file. Here I think we want the XAPI-provided file to not be altered at all, so that we can provide future updates to it. As @lindig mentioned, there's a directory for vendor defaults, so it seems that using /usr/lib/sysctl.d/ would be appropriate, if we can ensure that the files are loaded in the right order. If for some reason a user would had to override some settings, then they could drop a file in /etc.

@ydirson Does this look good to to you?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, 100%

@bleader
Copy link

bleader commented Jan 8, 2026

The PR description has " - <title>" and the description " - <another_title>" and the link is not clear for someone outside XenServer. I assume the PR title reference the original issue this fixes and the description one is the solution found to fix it?

@gangj
Copy link
Contributor Author

gangj commented Jan 8, 2026

The PR description has " - <title>" and the description " - <another_title>" and the link is not clear for someone outside XenServer. I assume the PR title reference the original issue this fixes and the description one is the solution found to fix it?

Sorry, added detailed context for the fix in the description, pls review it again, thanks.

gangj added 2 commits January 9, 2026 18:23
Fixes slave xapi hang during pool join when jumbo frames are configured
but the network path doesn't support them.

Problem:
When MTU mismatch occurs (interface configured for 9000 but path supports
only 1500), RPC connections hang on large requests (~1613 bytes). The
hanging connection holds a database lock, blocking all other DB operations
and causing the entire slave xapi to become unresponsive during pool join.

Root cause:
Without Path MTU Discovery, TCP cannot detect when the path MTU is smaller
than the configured interface MTU. When ICMP "Fragmentation Needed" messages
are blocked by firewalls, TCP has no feedback mechanism to reduce packet size.
Packets exceeding the path MTU are silently dropped by network infrastructure,
leading to connection timeouts. The application-level retry logic (in
master_connection.ml) attempts reconnection, but each retry encounters the
same issue while holding a database lock, causing extended hangs.

Solution:
Enable TCP PMTUD to allow automatic MTU detection and adaptation.

Configuration:
- net.ipv4.tcp_mtu_probing=1: Enable automatic MTU detection when ICMP
  blackhole is detected (recommended setting)
- net.ipv4.tcp_base_mss=1024: Base MSS for MTU probing

With PMTUD enabled, TCP detects packet loss patterns indicating MTU issues
and proactively reduces packet size to find a working MTU. This works even
when ICMP Fragmentation Needed messages are blocked by firewalls, allowing
connections to succeed and preventing database lock contention.

Files:
- scripts/92-xapi-tcp-mtu.conf: New sysctl configuration file
- scripts/Makefile: Install sysctl config to /etc/sysctl.d/

The "92" prefix ensures this loads after basic network configuration
(91-net-ipv6.conf) but before local administrator overrides (99-*).

Reference: https://blog.cloudflare.com/path-mtu-discovery-in-practice/

Signed-off-by: Gang Ji <[email protected]>
Add diagnostic tests during pool join to detect and warn about MTU
mismatches, particularly when higher MTU values are configured but
the network path doesn't support them.

While TCP PMTUD (enabled in previous commit) fixes the hang automatically,
this provides visibility into MTU configuration issues so customers can
fix their network infrastructure.

The diagnostics:
1. Query master's management network MTU via RPC
2. Detect VLAN configuration and account for 4-byte overhead
3. Calculate ICMP payload dynamically:
   MTU - IP header (20) - ICMP header (8) - VLAN (4 if present)
4. Test standard MTU (1500) with ICMP ping
5. Test configured MTU if > 1500
6. Create pool-level alert when CA-384228 scenario detected:
   - Standard MTU (1500) works
   - Configured higher MTU fails
   - This indicates path MTU < configured MTU

Key design decisions:
- Does NOT block pool join (ICMP may be blocked by firewalls)
- Queries master's DB via verified RPC (slave's DB not yet synced)
- Called after certificate exchange with verified connection
- Creates pool-level alert for customer visibility in XenCenter/CLI
- Relies on TCP PMTUD (enabled by previous commit) to prevent hangs
- Diagnostics are informational only, providing visibility

The implementation dynamically calculates test packet sizes based on
actual configured MTU rather than assuming fixed values, making it
work correctly with any MTU configuration (not just jumbo frames).

Warning format highlights the issue clearly and references the
TCP PMTUD fix that handles it automatically, with guidance for
infrastructure improvements.

Signed-off-by: Gang Ji <[email protected]>
@gangj gangj force-pushed the private/gangj/CA-384228 branch from 0d7e423 to dcaf8d9 Compare January 9, 2026 10:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants